Ford Go Bike Dataset

by Japheth Rutoh

Investigation Overview

The investigation on the dataset used in this analysis is mainly based on the duration of bike rides taken by users.

Dataset Overview

The dataset used for analysis is a bike share dataset containing information about bike rides in the San Francisco Bay Area, for the month of February 2019.

Distribution of Duration

The main feature for our analysis is the bike ride duration.

The logarithmic plot for the distribution of duration shows that it is bi-modal at 5 and 10 minutes

In [3]:
bins = 10 ** np.arange(0, np.log10(ford['duration_min'].max())+0.1, 0.1)
ticks =  [1 , 3, 10, 30, 100, 300]
labels = [f'{v}' for v in ticks]

plt.figure(figsize=[15, 5])
plt.hist(data = ford, x = 'duration_min', bins = bins)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.xlabel('Duration (min)')
plt.ylabel('Number of Rides')
plt.title('Log Distribution of Duration');

Number of Bike Rides by weekday.

Thursday has the highest number of rides compared to the rest of the days in the week.

In [4]:
base_color = sns.color_palette()[0]
# set the order of days of the week
day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# distribution of days of the week
plt.figure(figsize=(15,6))
base_color = sns.color_palette()[0]
sns.countplot(data=ford,x='day_of_week',color=base_color,order=day_order)
plt.xlabel('Day of Week')
plt.ylabel('Count')
plt.title('Number of Rides by Day of the Week');

Weekly Bike Ride Duration by user type

Customers have longer bike durations on average compared to Subscribers.

The average number of minutes of a bike duration rises slightly for Subscribers whereas it spikes for Customers.

The average duration for both Subscribers and Customers increases during the weekend. This is the case even though the number of bike rides is lower compared to the weekdays.

In [5]:
g = sns.catplot(data=ford,x='day_of_week',col='user_type',y='duration_min',kind='bar',
                sharey=False,
                color=base_color,
                order=day_order)
g.fig.suptitle('Average Weekly Bike Duration by User Type',y=1.03,fontsize=13)
g.set_titles('{col_name}')
g.set_ylabels('Duration (min)')
g.set_xlabels('Day of Week')
g.set_xticklabels(rotation=45);

Duration vs Age by User Type

Subscribers have more data points on the lower durations and more spread across ages 20 and 80 years.

The distribution for both Customers and Subscribers are almost similar except the Subscriber facet has more data points.

In [6]:
px.scatter(x='age',data_frame=ford,y='duration_min', color='user_type',
           symbol='user_type' ,symbol_sequence=['0', '0'],opacity=0.5,facet_col='user_type',
           height=500,
           title='Duration Age ScatterPlot by User Type',
           labels={'user_type':'User Type',
                   'age':'Age',
                   'duration_min':'Duration (min)'})

Duration vs. User Type by Gender

Male and female Customer rides are more spread between the upper fence of the box plot and the minimum duration.

The Subscriber rides are tightly spread compared to Customer rides.

We had to use a range of 60 minutes as most of the data is located there.

In [13]:
# create a sample of the data since there is a lot of data.
sample = np.random.choice(ford.shape[0],3000,replace=False)
sample_df =ford.iloc[sample,:]

px.box(data_frame=sample_df,x='user_type',y='duration_min',color='member_gender',range_y=[0,60],height=500,
       labels={'duration_min':'Duration (min)',
               'user_type':'User Type',
               'member_gender': 'Gender'},
       title='Duration User Type Boxplot'
)

Lineplot for the average bike ride durations during the week

The average bike ride duration plot shows that Customers spend more time riding than Subscribers. There is also a spike in the duration as the weekend approaches for customers compared to subscribers which sees only a small rise.

In [9]:
px.line(week_df_avg,x='day_of_week',y='duration_min',color='user_type',line_shape='spline',
        labels = {'user_type':'User Type',                                                                                                  'duration_min':'Duration (min)',                                                                                                 'day_of_week':'Day of Week'}
        ).update_layout(
    title ='Average Bike Ride Duration during the Week'
)

Lineplot for the daily total duration of bike ride duration during the week

The trend of the daily total bike ride duration shows that there are fewer customers than subscribers since there is a very big difference between the total minutes by each category for a single day.

The total duration for subscribers drops off at the startof the weekend.

In [10]:
px.line(week_df_size,x='day_of_week',y='duration_min',color='user_type',line_shape='spline',
        labels = {'user_type':'User Type',
                  'duration_min':'Duration (min)',
                  'day_of_week':'Day of Week'}
        ).update_layout(
    title ='Bike Ride Duration during the Week'
)